Skip to content

feat(otel): add token breakdown attributes to conclusion spans#26121

Merged
pelikhan merged 2 commits intomainfrom
copilot/add-token-breakdown-to-spans
Apr 14, 2026
Merged

feat(otel): add token breakdown attributes to conclusion spans#26121
pelikhan merged 2 commits intomainfrom
copilot/add-token-breakdown-to-spans

Conversation

Copy link
Copy Markdown
Contributor

Copilot AI commented Apr 14, 2026

sendJobConclusionSpan emitted only gh-aw.effective_tokens — a cost-weighted aggregate — while the full per-type breakdown already existed in /tmp/gh-aw/agent_usage.json (written by parse_token_usage.cjs) but was never read.

Changes

  • send_otlp_span.cjs: After the gh-aw.effective_tokens block, reads agent_usage.json via the existing readJSONIfExists helper and conditionally pushes four new span attributes (only when value > 0):

    • gh-aw.tokens.input
    • gh-aw.tokens.output
    • gh-aw.tokens.cache_read
    • gh-aw.tokens.cache_write
  • send_otlp_span.test.cjs: New describe("token breakdown enrichment in conclusion span") block covering: all-fields present, file absent, zero-value omission, and invalid JSON — matching the pattern used by the rate-limit enrichment tests.

const agentUsage = readJSONIfExists("/tmp/gh-aw/agent_usage.json") || {};
if (typeof agentUsage.input_tokens === "number" && agentUsage.input_tokens > 0) {
  attributes.push(buildAttr("gh-aw.tokens.input", agentUsage.input_tokens));
}
// … output_tokens, cache_read_tokens, cache_write_tokens

These attributes enable cache-hit-rate panels (cache_read / (cache_read + cache_write)), per-type cost attribution, and fine-grained threshold alerts without requiring step summary HTML.

Read agent_usage.json in sendJobConclusionSpan and emit
gh-aw.tokens.input, gh-aw.tokens.output, gh-aw.tokens.cache_read,
and gh-aw.tokens.cache_write span attributes.

Closes #<issue>

Agent-Logs-Url: https://github.com/github/gh-aw/sessions/5f05f55a-111f-459e-9aa5-34bca00d4a14

Co-authored-by: pelikhan <4175913+pelikhan@users.noreply.github.com>
Copilot AI changed the title [WIP] Add token breakdown attributes to conclusion spans feat(otel): add token breakdown attributes to conclusion spans Apr 14, 2026
Copilot AI requested a review from pelikhan April 14, 2026 00:52
@pelikhan pelikhan marked this pull request as ready for review April 14, 2026 00:53
Copilot AI review requested due to automatic review settings April 14, 2026 00:53
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds per-token-type breakdown attributes to the job conclusion OTLP span by reading the existing /tmp/gh-aw/agent_usage.json, enabling more detailed observability (cache hit rate, per-type cost attribution, and alerting) than the current gh-aw.effective_tokens aggregate alone.

Changes:

  • Enriches sendJobConclusionSpan with four new conditional span attributes: input/output/cache read/cache write tokens.
  • Updates sendJobConclusionSpan JSDoc to document the additional runtime file.
  • Adds a focused test suite covering presence/absence, zero-value omission, and invalid JSON handling for agent_usage.json.
Show a summary per file
File Description
actions/setup/js/send_otlp_span.cjs Reads agent_usage.json and conditionally emits gh-aw.tokens.* attributes on conclusion spans.
actions/setup/js/send_otlp_span.test.cjs Adds tests validating token breakdown enrichment behavior in conclusion spans.

Copilot's findings

Tip

Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

  • Files reviewed: 2/2 changed files
  • Comments generated: 0

@github-actions github-actions bot mentioned this pull request Apr 14, 2026
@github-actions
Copy link
Copy Markdown
Contributor

🧪 Test Quality Sentinel Report

Test Quality Score: 83/100

Excellent test quality

Metric Value
New/modified tests analyzed 4
✅ Design tests (behavioral contracts) 4 (100%)
⚠️ Implementation tests (low value) 0 (0%)
Tests with error/edge cases 3 (75%)
Duplicate test clusters 0
Test inflation detected Yes (108 test lines / 23 prod lines ≈ 4.7:1)
🚨 Coding-guideline violations None

Test Classification Details

Test File Classification Issues Detected
includes all four token breakdown attributes when agent_usage.json is present actions/setup/js/send_otlp_span.test.cjs ✅ Design Happy path only; verifies all 4 OTLP attributes carry correct values
omits all token breakdown attributes when agent_usage.json is absent actions/setup/js/send_otlp_span.test.cjs ✅ Design Edge case: ENOENT; verifies no spurious attributes emitted
omits a token attribute when its value is zero actions/setup/js/send_otlp_span.test.cjs ✅ Design Edge case: zero-value filtering boundary
omits token breakdown attributes when agent_usage.json contains invalid JSON actions/setup/js/send_otlp_span.test.cjs ✅ Design Error case: malformed input; verifies graceful degradation

Flagged Tests — Requires Review

No tests require immediate review. One non-blocking observation is noted below.

i️ Test inflation ratio (informational)

actions/setup/js/send_otlp_span.test.cjs added 108 lines while the production file send_otlp_span.cjs added 23 lines (ratio ≈ 4.7:1, exceeding the 2:1 threshold). This triggers a 10-point score deduction per the rubric, but in context it is expected: each test case requires significant boilerplate (mock setup, fetch stub, env vars, full OTLP response parsing). The coverage is genuine and the overhead is justified by the 4 distinct behavioral scenarios tested.


Language Support

Tests analyzed:

  • 🐹 Go (*_test.go): 0 tests (no Go test files changed)
  • 🟨 JavaScript (*.test.cjs): 4 tests (vitest)

Verdict

Check passed. 0% of new tests are implementation tests (threshold: 30%). All 4 new tests verify observable OTLP span output — the attributes emitted to telemetry backends — under four meaningful scenarios: all tokens present, file absent, zero-value tokens omitted, and invalid JSON tolerated gracefully. The vi.spyOn(fs, "readFileSync") mocking targets external filesystem I/O, which is legitimate per project guidelines.


📖 Understanding Test Classifications

Design Tests (High Value) verify what the system does:

  • Assert on observable outputs, return values, or state changes
  • Cover error paths and boundary conditions
  • Would catch a behavioral regression if deleted
  • Remain valid even after internal refactoring

Implementation Tests (Low Value) verify how the system does it:

  • Assert on internal function calls (mocking internals)
  • Only test the happy path with typical inputs
  • Break during legitimate refactoring even when behavior is correct
  • Give false assurance: they pass even when the system is wrong

Goal: Shift toward tests that describe the system's behavioral contract — the promises it makes to its users and collaborators.

🧪 Test quality analysis by Test Quality Sentinel · ● 536.7K ·

Copy link
Copy Markdown
Contributor

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

✅ Test Quality Sentinel: 83/100. Test quality is excellent — 0% of new tests are implementation tests (threshold: 30%). All 4 new tests are behavioral contracts that verify observable OTLP span attributes under distinct scenarios (happy path, absent file, zero-value filtering, invalid JSON). No coding-guideline violations detected.

@pelikhan pelikhan merged commit c778e52 into main Apr 14, 2026
77 of 78 checks passed
@pelikhan pelikhan deleted the copilot/add-token-breakdown-to-spans branch April 14, 2026 01:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[otel-advisor] add token breakdown (input/output/cache) attributes to conclusion spans

3 participants